Skip to content

[SPARK-56912][SQL] Simplify Cast to boolean codegen under ANSI mode#55937

Closed
gengliangwang wants to merge 2 commits into
apache:masterfrom
gengliangwang:SPARK-56912-cast-boolean
Closed

[SPARK-56912][SQL] Simplify Cast to boolean codegen under ANSI mode#55937
gengliangwang wants to merge 2 commits into
apache:masterfrom
gengliangwang:SPARK-56912-cast-boolean

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

@gengliangwang gengliangwang commented May 17, 2026

What changes were proposed in this pull request?

Extend UTF8StringUtils.scala with toBooleanExact(UTF8String, QueryContext) and use it from Cast.scala for the ANSI String -> Boolean cast path (both eval and codegen), alongside the peer string ANSI helpers (toByteExact / toShortExact / toIntExact / toLongExact). The non-ANSI path keeps the inline if/else if/else evNull = true form because it has no error to throw.

Why are the changes needed?

Part of SPARK-56908 (umbrella). The ANSI String->Boolean cast emits an 8-line if (isTrueString) … else if (isFalseString) … else throw block in codegen. This PR collapses it to a one-line UTF8StringUtils.toBooleanExact(...) call.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite *CastWithAnsiOffSuite *AnsiCastSuite *TryCastSuite"

307/307 pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x

@gengliangwang
Copy link
Copy Markdown
Member Author

gengliangwang commented May 17, 2026


Stack overview (SPARK-56908 umbrella)

This PR is part of the SPARK-56908 codegen-simplification series. Current status:

Merged:

Open:

@gengliangwang
Copy link
Copy Markdown
Member Author

Audited this PR for the same lessons surfaced by @viirya and @cloud-fan on #55938 (and applied to #55934 / #55939):

  1. Are the helpers redundant with an existing Scala API? No — there's no existing Scala object that wraps StringUtils.isTrueString / isFalseString with the ANSI invalidInputSyntaxForBooleanError throw. The helper is net-new.
  2. Are the eval-path additions redundant? No — master ANSI string -> boolean was a 7-line inline block. The new ANSI branch is a one-line helper call; the non-ANSI branch is simplified by moving the ansiEnabled check out into the case clause.

So no changes needed on this PR for that review.

### What changes were proposed in this pull request?

Extend `CastUtils.java` with `stringToBooleanExact(UTF8String, QueryContext)`
and use it from `Cast.scala` for the ANSI `String -> Boolean` cast path
(both eval and codegen). The non-ANSI path keeps the inline
`if/else if/else evNull = true` form because it has no error to throw.

### Why are the changes needed?

Part of SPARK-56908 (umbrella). The ANSI String->Boolean cast emits an
8-line `if (isTrueString) … else if (isFalseString) … else throw` block
in codegen. This PR collapses it to a one-line `CastUtils
.stringToBooleanExact(...)` call.

### Does this PR introduce _any_ user-facing change?

No. The compiled behavior is identical; only the emitted Java source
text changes.

### How was this patch tested?

```
build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite \
  *AnsiCastSuite *TryCastSuite"
```

204/204 pass.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x
@gengliangwang gengliangwang force-pushed the SPARK-56912-cast-boolean branch from 365cabe to b186a2a Compare May 28, 2026 05:57
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is part of the SPARK-56908 codegen-simplification stack (described in your first conversation comment). The ANSI String→Boolean cast previously emitted an 8-line if/else if/else throw block in codegen and a parallel block in the eval path. The PR collapses both into a single helper call.

Prior state. castToBoolean (eval) and castToBooleanCode (codegen) each had a single _: StringType case that branched on ansiEnabled to choose between throwing invalidInputSyntaxForBooleanError and assigning null / evNull = true. The ANSI throw path duplicated the helper-method machinery from peer ANSI casts.

Design approach. Add a stringToBooleanExact(UTF8String, QueryContext) helper that performs the isTrueStringisFalseString → throw sequence and call it from a new case _: StringType if ansiEnabled branch in both castToBoolean and castToBooleanCode. The non-ANSI branch keeps the inline if/else if/else because there's no error to throw — same separation that peer casts (castToByteCode, castToIntCode, etc.) already use.

Key design decision — where the helper lives. The PR places it in CastUtils.java next to the integral/fractional narrowing helpers, but the peer string→primitive ANSI helpers (toByteExact / toShortExact / toIntExact / toLongExact) live in UTF8StringUtils.scala. See the inline comment recommending the move — it preserves the string-source/numeric-source split that UTF8StringUtils and CastUtils already maintain, and keeps CastUtils's class Javadoc accurate.

Implementation sketch. Behavior is preserved across both eval and codegen paths: the helper executes the same isTrueString / isFalseString / throw sequence; the non-ANSI inline branch keeps the same isTrueString / isFalseString / null sequence. Existing test coverage (CastSuite / CastWithAnsiOnSuite / CastWithAnsiOffSuite / TryCastSuite, plus the postgreSQL/boolean.sql.out golden file) exercises both paths.

return d.changePrecision(precision, scale) ? d : null;
}

// ----- string -> boolean (ANSI: throw on invalid syntax) -----
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move stringToBooleanExact to UTF8StringUtils.scala — the peer string→primitive ANSI helpers (toByteExact / toShortExact / toIntExact / toLongExact) already live there. Renaming to toBooleanExact(UTF8String, QueryContext): Boolean for symmetry, and switching the codegen call site to UTF8StringUtils.getClass.getCanonicalName.stripSuffix("$") like the peers (e.g. castToByteCode at Cast.scala:2063), gives a fully consistent shape.

This also keeps the CastUtils.java class Javadoc accurate — its current "for ANSI overflow-checked casts" line and the per-row references[] rationale apply to the integral/fractional narrowing helpers but not to a string→boolean invalid-input check.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 83a3cbb112a:

  • Added UTF8StringUtils.toBooleanExact(UTF8String, QueryContext): Boolean next to the existing toByteExact / toShortExact / toIntExact / toLongExact peers.
  • Switched the eval and codegen call sites in Cast.scala (castToBoolean / castToBooleanCode) to use it via the getCanonicalName.stripSuffix("$") pattern, matching castToByteCode etc.
  • Removed stringToBooleanExact from CastUtils.java along with its UTF8String / StringUtils imports, so the class Javadoc's "ANSI overflow-checked casts" + per-row references[] framing stays accurate.

Updated the UTF8StringUtils class doc from "casting string to numeric values" to "casting string to primitive values under ANSI mode" so the new boolean helper is in scope.

lint-scala clean, 307/307 Cast tests pass.

…s.toBooleanExact, peer the other string ANSI helpers
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR description correct, @gengliangwang ? stringToBooleanExact in this PR?

Extend CastUtils.java with stringToBooleanExact(UTF8String, QueryContext) and use it from Cast.scala for the ANSI String -> Boolean cast path (both eval and codegen). The non-ANSI path keeps the inline if/else if/else evNull = true form because it has no error to throw.

@gengliangwang
Copy link
Copy Markdown
Member Author

Good catch, @dongjoon-hyun — the description was stale. The helper was moved/renamed in the follow-up commit (stringToBooleanExact in CastUtils.javaUTF8StringUtils.toBooleanExact, peering the other string ANSI helpers). I've updated the PR description to match the current code.

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @gengliangwang and @cloud-fan .

@gengliangwang
Copy link
Copy Markdown
Member Author

@cloud-fan @dongjoon-hyun Thanks for the review. Merging this one to master/4.x

gengliangwang added a commit that referenced this pull request May 29, 2026
### What changes were proposed in this pull request?

Extend `UTF8StringUtils.scala` with `toBooleanExact(UTF8String, QueryContext)` and use it from `Cast.scala` for the ANSI `String -> Boolean` cast path (both eval and codegen), alongside the peer string ANSI helpers (`toByteExact` / `toShortExact` / `toIntExact` / `toLongExact`). The non-ANSI path keeps the inline `if/else if/else evNull = true` form because it has no error to throw.

### Why are the changes needed?

Part of SPARK-56908 (umbrella). The ANSI String->Boolean cast emits an 8-line `if (isTrueString) … else if (isFalseString) … else throw` block in codegen. This PR collapses it to a one-line `UTF8StringUtils.toBooleanExact(...)` call.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite *CastWithAnsiOffSuite *AnsiCastSuite *TryCastSuite"
```

307/307 pass.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x

Closes #55937 from gengliangwang/SPARK-56912-cast-boolean.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 634cda0)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants